Skip to content

Comments

FIX Support errors in MultiPromptSendingAttack, add safe completion support to SelfAskRefusalScorer#1366

Merged
jsong468 merged 12 commits intoAzure:mainfrom
fdubut:bug_fixes
Feb 18, 2026
Merged

FIX Support errors in MultiPromptSendingAttack, add safe completion support to SelfAskRefusalScorer#1366
jsong468 merged 12 commits intoAzure:mainfrom
fdubut:bug_fixes

Conversation

@fdubut
Copy link
Contributor

@fdubut fdubut commented Feb 12, 2026

Description

A couple of fixes:

  • Support content moderation errors in MultiPromptSendingAttack. Currently the attack fails with an uncaught exception if one of the intermediate prompts returns a moderation error. With the fix, it will fail gracefully.
  • Support safe completions in SelfAskRefusalScorer. Currently, the scorer will lean towards "not a refusal" if the model returns a safe completion (which most modern models post GPT-5 will do). With the added option, safe completions are considered a refusal. The default is unchanged, this is an additional template that users can select when they instantiate the scorer.

Tests and Documentation

  • Added one test to verify SelfAskRefusalScorer throws an exception when no objective is provided and safe completions are disallowed (the scorer needs to know the objective to assess whether this was a "safe" completion or a "true" completion).

@jsong468 jsong468 merged commit 8f923e2 into Azure:main Feb 18, 2026
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants